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ABSTRACT 

Markov Chain Monte Carlo (MCMC) methods have revolutionised Bayesian data anal- 
ysis over the years by making the direct computation of posterior probability densities 
feasible on modern workstations. However, the calculation of the prior predictive, the 
Bayesian evidence, has proved to be notoriously difficult with standard techniques. In 
this work a method is presented that lets one calculate the Bayesian evidence using 
nothing but the results from standard MCMC algorithms, like Metropolis-Hastings. 
This new method is compared to other methods like MultiNest, and greatly outper- 
forms the latter in several cases. One of the toy problems considered in this work is the 
analysis of mock pulsar timing data, as encountered in pulsar timing array projects. 
This method is expected to be useful as well in other problems in astrophysics, cos- 
mology and particle physics. 
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1 INTRODUCTION 

Bayesian inference has proved over the years to be a very 
I ■ powerful tool in many branches of science as it gives a very 
0^ clear prescription of how to analyse datasets without loss 
f*"*) , of information, even for very complicated models. On the 
JL^ ■ practical side, in performing Bayesian analysis two difficult 
. £^ | problems often emerge: 

■ 1. Producing the posterior probability density functions 
•_h ' (PDFs) for several interesting parameters requires marginal- 
isation, i.e. the integration of the full joint posterior PDF 
over most model parameters. In a majority of cases this must 
be done numerically, and it is common to employ a Markov 
Chain Monte-Carlo (MCMC) algorithm for performing the 
integration. 

2. Model selection requires the calculation of the Bayes fac- 
tor: the ratio of the prior predictive values of two competing 
models. This prior predictive, or Bayesian evidence as we 
will call it, is the integral of the full joint posterior PDF 
over all model parameters. The calculation of this value can 
be notoriously difficult using standard techniques. 

The use of Markov Chain Monte Carlo (MCMC) algo- 
rithms, such as the Metropolis-Hastings algorithm, has be- 
come extremely popular in the calculation of the marginal 
posterior PDFs. MCMC algorithms allow one to sample 
from posterior distribution of complicated statistical mod- 
els, greatly reducing the effort involved in evaluating high 
dimensional numerical integrals. The problem with standard 



Email: haasteren@strw.leidenuniv.nl 



MCMC algorithms lies in the fact that the normalisation 
constant of the marginal posterior PDFs is lost in the cal- 
culation, making it difficult to calculate the second of the 
above mentioned quantities. 

Over the years, several methods capable of calcu- 
lating the Bayesian evidence h ave been developed , such 
as using the harmonic mean (Newton fc Raftervl 1 19941 ) 
and parallel tempering l|Earl fc Deem! |2005l ). The prob- 
lems with these methods are the inaccuracy of the method 
for the harmonic mean, and the high computational cost 
for parallel tempering: this method can easily be 100 
times more time-consuming than regular MCMC algorithms 
IjNewman fc Barkemall 19991 ). More recent ef forts include the 
Nested sampling algorithm (|Skilling| [2004), a greatly im- 
proved version of which has been implemented in MultiNest 
jFeroz et al.ll2009l . hereafter FHB09). These recent methods 
rely on a transformation of the integral to a 1-dimensional 
problem, and then use a clever trick for avoiding the calcu- 
lation of the inverse of the cumulative posterior PDF. The 
MultiNest algorithm of FHB09 has been successfully tested 
on several (toy) problems, and seems to be much more ef- 
ficient than traditional methods. However, since the Multi- 
Nest algorithm by design samples from the prior - not from 
the posterior - we believe that the MultiNest algorithm suf- 
fers more from the curse o f dimensiona lity than the tra- 
ditional MCMC algorithms l|Evansll200"r3 ). When analysing 
complicated models with many parameters (~ 100 or more) , 
the scaling with dimensionality is crucial. 

In this paper we present a method to calculate the 
Bayesian evidence from the MCMC chains of regular MCMC 
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algorithms. This method can be applied to chains that have 
already been run, provided that the posterior values, calcu- 
lated with normalised prior and normalised likelihood, have 
been saved together with the values of the parameters at 
each point in the chain. For several cases, the accuracy of 
the new method is significantly greater than that of other 
methods with the same amount of sampled points, provided 
that the MCMC has been properly run. The error on the 
Bayesian evidence can be reliably calculated using a boot- 
strap procedure. 

The outline of the paper is as follows. In section [2] we 
briefly review the basic aspects of Bayesian inference for pa- 
rameter estimation and model selection. Then we provide 
the reader with some necessary details on MCMC in sec- 
tion |3] where we also outline the new algorithm to calculate 
the Bayesian evidence. In section [4] we assess the strengths 
and weaknesses of the new method relative to those that use 
nested sampling or parallel tempering. In section [S] we test 
all competing algorithms on some toy problems like the anal- 
ysis of pulsar timing data as encountered in pulsar timing 
array projects. 



2 BAYESIAN INFERENCE 

Bayesian inference methods provide a clear and consistent 
approach to parameter estimation and model selection. Con- 
sider a model, or Hypothesis, H with parameters 9 for a 
dataset d. Then Bayes' theorem states that 

p(e\d,H) = -i (i) 

V ' P[d\H) 

where P(9) := P(9 j d, H) is the posterior PDF of the 
parameters, L{&) := P(d | Q,H) is the likelihood function, 
7r(9) := P(9 | H) is the prior, and z := P(d \ H) is the 
prior predictive or Bayesian evidence as we call it. 

The Bayesian evidence is the factor required to nor- 
malise the posterior over 9: 

z = J L (<E») TV (9) d m 9, (2) 

where m is the dimensionality of 9. This Bayesian evidence 
can then be used to calculate the so-called odds ratio in 
favor of model Hi over Ho, which allows one to perform 
model selection: 

p(H \d) *>P{H )> (> 

where P(Ho) and P(Hi) are the prior probabilities for the 
different models . As with the prior for a model parameters, 
the prior probability for a model should be chosen to reflect 
the available information. 

In traditional MCMC methods as used in parameter es- 
timation problems, inferences are obtained by taking sam- 
ples from a distribution that is proportional to the poste- 
rior PDF. Therefore one usually ignores the normalisation 
constant. This is acceptable since these methods lose the in- 
formation about the normalisation anyway. But in contrast 
to parameter estimation problems, in model selection the 



Bayesian evidence plays a central role: it is a measure for 
how well the data supports the model. 

The average of the likelihood over the prior distribution, 
the evidence, is larger for a model if more of its parameter 
space is likely and smaller for a model with large areas in 
its parameter space having low likelihood values. Even if the 
likelihood function has high peaks, in order to increase the 
evidence these peaks must compensate for the areas in its 
parameter space where the likelihood is low. Thus the ev- 
idence automatically implements Occam's razor: a simpler 
theory with compact parameter space will have a larger ev- 
idence than a more complicated model, unless the latter is 
significantly better at explaining the data. 



3 MARKOV CHAIN MONTE CARLO 

Markov Chain Monte Carlo methods (MCMC) can be used 
to sample from very complicated, high dimensional distribu- 
tions; for Bayesian inference it is the posterior, but it could 
be any distribution. The method presented in this paper 
could be useful for integration problems other than Bayesian 
evidence calculation, so we use the more general /(9) to de- 
note this function. The samples drawn from this function 
can then be used to perform the integrals we need for the 
marginal posterior PDFs and, as we show, for the Bayesian 
evidence. The idea is quite straightforward: a large num- 
ber of samples drawn from a distribution proportional to 
/(9) will end up being distributed with a sample density 
proportional to /(9). The exact mechanism that produces 
these samples can differ between MCMC methods and is 
irrelevant for the purposes of this paper, but the result is 
always a large number of samples distributed according to 
/(9). The main advantage of this is that we do not have 
to sample from the entire volume of the parameter space or 
the prior, but only from a small fraction of it: the fraction 
where the function has a high value. Especially for functions 
in high-dimensional parameter spaces this feature is crucial. 

In section 13.11 we show that all information needed to 
calculate the Bayesian evidence is already present in the 
samples of the MCMC. Then we give a practical example of 
how to calculate the integral in practice in section [3T31 

3.1 Volume calculation 

Consider the unnormalised distribution: 

/ (as, y) = exp (-ax 2 - by 2 ) , (4) 

where a and b are arbitrary model parameters. For the val- 
ues a = 1/5 and b — 2/5, a Metropolis algorithm with 
N — 40000 samples yield a distribution of samples similar 
to Figure [T] We use the notation 9; = (xi, yi) to indicate 
the sample. We would now like to calculate the integral 

/ = j J f(x,y)dxdy. (5) 

In MCMC algorithms, it is usually required or straightfor- 
ward to calculate the function values /(9;) for each sample, 
and these values are often stored together with the values of 
the parameters. These function values can be used to calcu- 
late the integral if we treat the MCMC samples in parameter 
space as an irregular grid. The most obvious and exact way 
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Density plot of function 




Figure 1. A scatter plot of of 40000 samples, drawn using a 
Metropolis algorithm from the function f(x, y) = exp(— ax 2 — 
by 2 ), with a = 1/5 and b = 2/5. We have included the burn-in 
period in this plot, even though that part of the chain should be 
discarded for any further calculations. 



Voronoi tessellation of Gaussian distributed samples 




Figure 2. An example of a Voronoi tessellation. Here we have 
taken 200 samples from a 2-dimensional Gaussian distribution as 
the centres of the Voronoi diagram. 



to do this is to calculate the Voronoi tessellation for the sam- 
ples in the parameter space, an example of which is given in 
Figure [2] The samples are then the centres of the Voronoi 
cells, and the integral can be calculated as follows: 



/«£/ ©Y)a 



(6) 



,th 



where Oi is the area of the i Voronoi cell, and we only 
sum over closed cells (with finite area). Theoretically, this 
procedure converges to the correct value in all cases for large 
number of samples N. However, although Voronoi tessella- 
tions can be computed for any dimensionality, this becomes 
computationally pro hibitive in practice for pro blems with 
high dimensionality ( Edelsbmnncr & Shah 1996). This pro- 



cedure does illustrate that all information needed to evaluate 
the integral / is present in the MCMC chain. 

3.2 Estimating the covered parameter volume 



As we have shown in section I3TT1 the integral / can be ap- 
proximated by the summation over the function values /(Oi) 
times the Voronoi cell size Oi. The value Oi is the parame- 
ter volume occupied by the Voronoi cell of the i^ 1 MCMC 
sample, so in the rest of this paper we call Oi the parameter 
volume of the i sample. Similarly, the parameter volume 
of the entire chain can now be approximated by: 



(7) 



where O is the parameter volume of the entire MCMC chain, 
and N is the number of samples in the chain. 

We now show that we do not need the Voronoi tessella- 
tions to calculate the parameter volume and the integral. For 
the rest of this section we regard the more general case that 
we have an n-dimensional parameter space, instead of the 
2-dimensional case of the above example. In any MCMC al- 
gorithm, the density of points of the chain at a certain point 
Oi in parameter space will become proportional to the func- 
tion value /(Oi) for large N. Therefore, 



lim Oi = 



f ©* 



(8) 



where a is a proportionality constant to be determined. If 
we somehow can estimate the constant a for this chain, we 
can immediately calculate the integral by I = Na. 

The proportionality constant a can be estimated for a 
certain chain by regarding a small subset C R n of the 
parameter space for which: 

1. we can calculate what the volume V\ of the small subset 
is. 

2. the subset is sufficiently and completely populated by 
MCMC samples: all parts of F^ should have a sample den- 
sity proportional to /(©) 

MCMC algorithms are employed in order to avoid having to 
draw samples from the entire parameter space, so we have 
to be careful to choose the right test subset in order to ful- 
fill requirement 2. If we have found such a subset F^, the 
following equation allows us to estimate a: 



V t = a 

e^ t /(©*)' 
The calculation of the integral is now trivial: 
I = Na. 



3.3 A practical algorithm 



(9) 



(10) 



In this section we construct a simple practical algorithm 
for numerically calculating integrals using regular MCMC 
methods. As stated in section [3721 we need to define a small 
subset of the parameter space that is sufficiently populated 
by MCMC samples. The most obvious choice would be to 
look for a peak in the distribution. In the case of mul- 
timodal posteriors, we should take samples that 'belong' 
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to one specific mode. Clustering algorit hms like X-means 
dPelleg fc Moorei |2000| ). and G-means (jHamerlv fe Elkanl 
120031 ) can be used for this. 

Assuming that we have found M samples that belong to 
a specific mode, we can then define a subset as follows. 
Assume that the M samples are sorted according to their 
function value /(Bj), with /(Bo) the highest value. We then 
first approximate the maximum of /(6) as: 



1 k 



k 



(11) 



where k = aM is a small fraction of M, dependent on the 
shape of the mode. We use a = 1/20 in several cases. We 
now define our subset F as an ellipsoid with centre /2, and 
covariance matrix C, defined by: 



C 



-1 71 

n f—f V 



P) [Qi — P 



(12) 



where n = bM is a fraction of M, again dependent on the 
shape of the mode. We use b = 1/5 in several cases. All 
samples within this ellipsoid of size V r' 2 det C satisfy: 



e 



C" 



e 



< r 



(13) 



We adjust the parameter r 2 such that I = cM samples sat- 
isfy this relation. It is crucial that we choose I to be as large 
as possible, while still satisfying the requirement that the 
entire ellipsoid is sufficiently populated with samples. If I is 
too small, we will not achieve very high accuracy for our in- 
tegral /, but if I is too large, we have underpopulated parts 
of the ellipsoid which results in the wrong outcome. We use 
c = 1/3 in several cases. 

We now have all the ingredients needed to calculate the 
integral /, as the Volume of our fe-dimensional subset is given 
by: 



r 7T 2 

rfi + l 



VdetC. 



(14) 



This, combined with equations © and (|10[) allows us to 
evaluate the integral. 

We would like to note that this prescription is merely 
one of many that could be used. In this case we have defined 
our subset F^ as an ellipsoid located at a maximum of our 
integrand, but any other subset that meets the criteria of 
section [3721 will suffice. 



3.4 Error estimates 

A MCMC algorithm generates samples according to the dis- 
tribution /(O). Consider the amount of samples inside the 
subset F^. This amount follows a Poissonian distribution 
with mean and variance equal to cM. We thus obtain the 
theoretical estimate for the error in the integral: 



AI = 



(15) 



As we show in section 15.11 this estimate is reliable for 
Monte Carlo algorithms that yield uncorrelated samples. In 
practice, however, many MCMC algorithms produce cor- 
related samples since a whole chain of samples is used 



^Roberts et alfi .997) . As we show in section[5j the use of cor- 
related MCMC samples for numerical integral as proposed 
in this paper invalidates the error estimate of equation (|15p . 

Having efficiency as one of our goals, we would like to 
produce reliable error-bars on our numerical integral using 
the correlated MCMC sam ples. We propose to use a boot- 
strap method (Efronl [l979l ). In the case of several chains, 
one can use the spread in the estimates based on different 
chains to estimate the error of the integral. In the case of a 
single chain, we propose to divide the chain in 10 succeeding 
parts, and use the spread of the integral in those 10 parts to 
estimate the error of the numerical integral. In section [57 
we test the error estimates discussed here extensively. 



4 COMPARISON TO OTHER METHODS 

Although the formalism developed in this paper is quite gen- 
erally applicable, in practice there are caveats for each in- 
tegration problem that one needs to be aware of in order 
to choose the right integration algorithm. In this section we 
discuss the strengths and the weaknesses of the method de- 
veloped in this paper, and we compare it to Nested sampling, 
and Parallel tempering. 

For all algorithms, the main criteria that we cover are 
efficiency, accuracy and robustness. 



4.1 Method using traditional MCMC 

The main problem that MCMC algorithms try to tackle is 
that of parameter space reduction; the full parameter space 
over which we would like to evaluate the integral is too 
large, which is why we cannot use a regular grid. The idea 
behind MCMC algorithms is that of importance sampling: 
only draw samples in the regions of parameter space where 
the function value /(B) is high enough to significantly con- 
tribute to the integral. This can be done by drawing samples 
from a distribution P(0) that resembles the function /(B), 
with the optimal choice being P(6) = /(B). MCMC are 
specifically designed to satisfy this optimal relation, mak- 
ing MCMC me thods optimal l y effi cient from a theoretical 
point of view (jRoberts et al. Hl997h . In Figure [JJ we show 
an example of samples generated with a Metropolis MCMC 
algorithm, released on a 2-dimensional Gaussian function. 

The drawback of MCMC algorithms is that due to 
the parameter space reduction one can never be sure that 
all important parts of the parameter space have been cov- 
ered. Some functions have many distinct peaks, the so-called 
modes of the function. MCMC algorithms often have diffi- 
culty to make the chain move from one mode to another. If 
the function /(B) is highly multimodal, we must make sure 
that the sample density ratio between the modes is right and 
that we have reached all the important modes in the entire 
parameter space. 

An additional practical challenge with the method de- 
veloped in this paper is to make sure that we have con- 
structed a subset of the parameter space that is sufficiently 
populated with samples. In the case of a not very peaked, 
or highly degenerate function this requires special care. 
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4.2 Nested sampling 

The nested sampling algorithm is a Monte Carlo technique 
aimed at accurat e evaluation o f numerical integrals, while 
staying efficient (Skilling 2004). The algorithm solves the 
problems of regular MCMC algorithms by starting with 
sampling from the original parameter space. In the case of 
Bayesian evidence calculation this is equivalent to sampling 
from the prior distribution of the parameters. The density 
with which the parameter space is sampled is adjustable in 
the form of an amount ul of so-called live points; the number 
of points evenly distributed among the part of the parame- 
ter space we are exploring. At the start of the algorithm, the 
live points are evenly distributed over the entire parameter 
space. Then parameter space reduction is achieved by re- 
placing the live points one-by-one under the restriction that 
the newly sampled live points are higher than the lowest 
one we have not replaced yet. Effectively this is the same as 
shrinking the parameter space by a fixed factor every time 
we replace a new live point, ultimately sampling only the 
part of the function close to the maximum. 

A big advantage of the nested sampling algorithm is 
that one generates samples directly from the whole param- 
eter space. If n,L is chosen high enough, i.e. there is a high 
enough sampling density in the parameter space, there is a 
very low probability that a mode of the distribution /(0) 
will be missed. This is one of the difficulties in many MCMC 
implement at ions . 

A big disadvantage of the nested sampling algorithm 
is that one directly generates samples from the whole pa- 
rameter space, instead of from a distribution that more 
closely resembles the function /(0) we want to integrate. 
This method will therefore never reach the efficiency that 
traditional MCMC methods offer. In Figure [3] we show an 
example of samples generated with a nested sampling al- 
gorithm, applied to a 2-dimensional Gaussian function. In 
higher dimensional problems, as discussed in section [57T1 the 
direct sampling from the whole parameter space can lead to 
serious efficiency problems that need to be solved. 



4.3 Parallel tempering 

Parallel tempering (PT) is an algorithm spawned from the 
desire to solve the problems of traditional MCMC methods. 
PT algorithms possess better mixing properties, allowing the 
chain to "escape" local extrema, and allow one to calculate 
the complete integr al, or in our case the Bayesian evidence 
<|Earl fc Deeml2 005). Let us briefly review PT in this section, 
without discussing it in too much detail. 

The main idea of PT is that of parameter space explo- 
ration by adding an imaginary inverse temperature (3 to the 
system, changing the integrand of our integral to: 



Density ploi af function 



U (e) = (/(e)) ' 



(16) 



Then many MCMC chains are released in the parameter 
space, each with a different temperature 6 [0, 1]. A clever 
swapping system is employed, allowing the chain with (5 — 1 
- the "true" chain - to swap parameters with chains of higher 
temperature every now and then provided that such a high- 
temperature chain was able to reach a point in parameter 
space with high /(O). This trick allows the coldest system, 
the "true" chain with f3 = 1, to escape local extrema. 




Figure 3. A scatter plot of 40000 samples, drawn using a 
nested sampling algorithm with 5000 live points from the function 
f(x, y) = exp(— ax 2 — by 2 ), with a = 1/5 and b = 2/5. 



The integral over /(C) is calculated by using all chains, 
not just the one with p = 1, as follows. We first define a 
partition function: 



Z{p) 



de/^ e 



(17) 



which has a logarithmic derivative of: 

d |io g (^)) = ^y d |^) = (^(/(e))>, ) C") 

where is the expectation value of a quantity over a dis- 
tribution proportional to fp(€)). Since we know that our 
desired integral can be expressed as I = Z(l), equation (|18l) 
is all we need to calculate it: 

log (Z(l)) = log (Z(0)) + J d/3 (log (,f(9)) ) p . (19) 

The observant reader will have noted that we have neglected 
to mention the size Z(0) of the parameter space that we 
explore with our chains. The high temperature chain with 
P = is unbounded by the function f(Q), and therefore will 
transverse the entire parameter space. We should make sure 
that we limit the size of the parameter space as much as 
possible, without missing any peaks of /(O). 

The main advantages of Parallel tempering are that it 
explores the entire parameter space, even in the presence 
of strong local peaks, and that the Bayesian evidence can 
be calculated. These advantages are accompanied with the 
large computational costs of the extra chains with j3 =^ 1, 
which has resulted in the need for alternative methods like 
MultiNest. As MultiNest is supposed to outperform PT in 
virtually all cases (FHB09), we will compare our method to 
MultiNest only in section [5] 



5 APPLICATIONS AND TESTS 

In this section we consider several toy models, and we apply 
several integration algorithms to these toy models as to com- 
pare the algorithms. In this whole section we have two ways 
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to use the MCMC algorithm. To produce correlated sam- 
ples, we use a regular Metropolis-Hastings algorithm where 
we use the entire chain. To produce uncorrelated samples, 
we use the same algorithm, but we only store one in every 
j samples produced by the algorithm. This does require us 
then to run the chain j times longer to produce the same 
number of samples. We choose j high enough that there is 
negligible correlation between 2 succeeding used samples; in 
the case of a n-dimensional Gaussian as in section 15.11 we 
use j = 100. 



5.1 Toy model 1: a high-dimensional Gaussian 

We first consider a problem that is a typical example of what 
MCMC algorithms where designed for: a highly peaked, 
high-dimensional function. The curse of dimensionality pro- 
hibits any direct numerical integration scheme on a fine grid, 
but analytical integration is possible. Consider the multipli- 
cation of n Gaussian functions, 

M*)=nv^ ex p(-^y ( 2 °) 

i=l " \ / 

with a,: the width of the Gaussian in the i direction. 
Now let us perform a volume preserving coordinate trans- 
formation using a random orthogonal matrix R as follows: 
Q — Rx. If we introduce a matrix A^ 1 = a% = 1 + i with 
A^ 1 — for i 7^ j, we then have a highly degenerate high- 
dimensional Gaussian function: 

h (e) = . 1 , exp f -ie T c" 1 e ) , (21) 

V J ( 2 ^)"/ 2 VdetC V 2 J' K ' 

where we have introduced the coherence matrix C = RAR T . 

We now apply in turn the MultiNest algorithm and 
the algorithm developed in this paper combined with 
Metropolis-Hastings to the multi-dimensional Gaussian for 
various number of dimensions. 

For the MultiNest algorithm, we use a number of live 
points L = 50n with n the number of dimensions, and we 
have set the sampling efficiency to e = 0.3 and the evidence 
tolerance to t = 0.1 as advocated in FHBofl 

We set up the Metropolis-Hastings algorithm to have 
an acceptance ratio equal to the optimal value of 23.4% 
l|Roberts et al.i ri997). The parameters of the algorithm ad- 
vocated in section [3.31 have been set to: a = 1/20, b — 
1/5, c = 1/3. We have used the number of samples N used 
by the MultiNest algorithm as the number of samples in our 
MCMC chain. However, we realise that the MultiNest algo- 
rithm might be improved in the future, as the algorithm is 
still under construction. The lowest efficiency (used samples 
/ drawn samples) we encountered for this toy-problem was 
e = 0.08; we therefore estimate that the error-bars could be 
decreased with a factor of maximally ^/l/0.08 ~ 3.5. The 
results of this toy-model are shown in table [1] 



1 We have not used wrapping for the parameters. Wrapping does 
speed up the sampling, but somehow te resulting evidence esti- 
mates are less accurate 



log (I) for different algorithms 



n 


# N 


MultiNest 


Unc. MCMC 


Cor. MCMC 


2 


2902 


-0.17 ±0.18 


-0.018 ±0.025 


0.03 ±0.025 


4 


7359 


0.20 ±0.17 


0.007 ± 0.024 


-0.01 ±0.03 


8 


24540 


0.17 ±0.17 


-0.01 ±0.01 


0.02 ±0.03 


16 


10 5 


0.05 ± 0.18 


0.001 ±0.006 


0.004 ±0.03 


32 


10 6 


1.43 ± 0.17 


0.004 ± 0.004 


-0.015 ±0.010 


64 


4.10 6 




-0.0004 ± 0.0007 


-0.02 ±0.016 


128 


10 7 




0.0006 ± 0.0004 


-0.05 ±0.03 



Table 1. The log-integral values of the function /i of equation 
121> . N is the number of samples, and n is the number of dimen- 
sions. The analytically integrated value is log 1 = for all values 
of n. For n 64, we were not able to successfully complete the 
MultiNest algorithm with said parameters due to limited com- 
putational power. Memory size was the limiting factor for the 
regular MCMC. 



i°g(f z ) 




Figure 4. Toy-model 2: the eggbox of equation (1221) 

5.2 Toy model 2: egg-box function 

Just as in FHB09, we now consider a highly multimodal 
two-dimensional problem for which the function resembles 
an egg-box. The function is defined as: 

f 2 (e) = exp [2 + cos (6i) cos (e 2 )] 5 , (22) 

where we set the domain of the function equal to [0, 107r] 
for both parameters. The shape of this function is shown 
in Figure [3] This is a typical problem where difficulties 
arise for traditional MCMC algorithms . Many solutions have 
been proposed for situations like this (|Newman fc Barkemal 
1999), but in practice one needs to have additional infor- 
mation about the problem for any of those solutions to be 
reliable. For the sake of clarity of this paper, we do not con- 
cern us with the practical implementation of the MCMC 
algorithm. We assume that a suitable trick can be found for 
the problem at hand so that the algorithm proposed in this 
paper can be used. For the eggbox toy-model we will use a 
jump-technique. At each iteration of the MCMC algorithm, 
there is a small probability, 1% in this case, that the chain 
will jump to a random neighbouring mode. 

The MultiNest algorithm is ideally suited for this prob- 
lem, as this is a low-dimensional multimodal function. With 
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log (I) for different algorithms 



# N 


MultiNest 


Unc. MCMC 


Cor. MCMC 


60210 


240.19 ± 0.05 


240.19 ±0.027 


240.23 ± 0.05 



Table 2. The log-integral values of a single mode of the eggbox 
function of equation (1 22 I t The fine-grid integrated value is log / = 
240.224 



enough live samples all modes should be easily discovered, 
and the peaks are symmetric and well-behaved. We run 
MultiNest on this problem with the same parameters as in 
FHB09: We use L = 2000 live points, efficiency e = 0.3, and 
tolerance t = 0.1. 

Table [5] shows the results of this analysifl. For evalu- 
ating the integral with the MCMC chain, we have taken a 
total of N = 60210 samples as was done with MultiNest, but 
we have used only the samples of one single peak in equa- 
tion ([9]). The amount of samples in a single peak is 2/25 
of the total amount of samples, leading to loss of accuracy. 
Though more sophisticated methods can be constructed by, 
say, averaging the values of the proportionality constant a of 
equation (|8) for all individual peaks, we show here that we 
only need to find one single portion of the parameter space 
that is sufficiently populated to calculate a reliable value for 
the integral. 

5.3 Application in Pulsar Timing 

In pulsar timing data analysis, one often encounters datasets 
of which the exact statistics are not well known. Bayesian 
model selection would provide the ideal tool to obtain in- 
formation about th e pow er spectra present in the data, 
van Haast eren et al. | (|2009l . hereafter vHLML) give a full de- 
scription of the Bayesian data analysis for pulsar timing ar- 
rays, but their work lacks a method to calculate the Bayesian 
evidence from the MCMC samples. In this section, we use a 
pulsar timing mock dataset to show that the method devel- 
oped in this paper is well-suited for Bayesian evidence cal- 
culation in pulsar timing problems. We also use MultiNest 
to analyse this dataset, and we compare the two results. 

For the purposes of this paper, the description of the 
pulsar timing posterior distribution is kept brief; full de- 
tails can be found in vHLML. The data of pulsar timing 
experiments consists of the arrival times of pulsar pulses 
(TO As), which arrive at the earth at highly predictable mo- 
ments in time (|Hobbs et alj |2006). The deviations from the 
theoretically predicted values of these TOAs are called the 
timing- residuals (TRs). These TRs are the data we concern 
ourselves with in this example. 

Consider n — 100 TRs, denoted as St, observed with in- 
tervals between succeeding observations of 5 weeks. Assume 
that we are observing a stable and precise millisecond pul- 
sar with timing accuracy about a — 100ns (the error bars on 
the TRs) . Usually a not precisely known, since pulsar timers 
generally assume that their estimate of the error bar is 
slightly off. Several datasets of millisecond pulsars also seem 

2 The true value is different from the one noted in FHB09. We 
have taken the hypercube transformation into account in our cal- 
culations. 
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Figure 5. The mock timing-residuals we have analysed with both 
MultiNest and the method developed in this paper 

to co ntain correlated low frequency noise l|Verbiest et all 
2009). We therefore also allow for some correlated timing- 
noise in the data, with a power-spectrum given by: 

S(f) =r 2 7 exp(- 7 /), (23) 

where / is the frequency, r is the amplitude of the correlated 
timing-noise in ns, and 7 is the typical size of the structures 
that appear in the data due to this correlated timing-noise. 
Following vHLML, we can now write the likelihood function 
for the TRs as a multi-dimensional Gaussian: 

P (ft I a,r yl ) = 1 = exp ( —ffc^ft j , 

V ' ' ' V N /(2^)" detC V 2 J 

(24) 

where C is an (n x x) matrix, with elements defined as: 

dj =a 2 5 tJ +r 2 ,f (25) 

with Sij the Kronecker delta, and m is the time difference 
between observation i and observation j. 

Simulating a mock dataset from such a Gaussian dis- 
tribution is quite straightforward; for details see vHLML. 
We now analyse a mock dataset, shown in Figure [5j gener- 
ated with parameters: a = 100ns, r — 100ns, and 7 = 2yr. 
We assume uniform prior distributions for all parameters: 
a £ [0, 1000]ns, r G [0, 1000]ns, and 7 € [0, 10]yr. The poste- 
rior is then sampled using both MultiNest and a Metropolis- 
Hastings algorithm, resulting in marginalised posterior dis- 
tributions as shown in Figure[6]&[7] Bayesian evidence values 
are in good agreement between the two methods: 

zmcmc = exp (1523.12 ±0.17) 
^MultiNest = exp (1522.93 ±0.15). (26) 

For both methods, the same number of samples has been 
used: N = 9617. 

5.4 Precision tests 

We now test the accuracy of the algorithm by running it 
many times on the same toy-problem, and then considering 
the statistics of the ensemble. We found the 16-dimensional 
Gaussian of section r5.ll to be an illustrative example. Just 
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Posterior distribution of o 



Ensemble of integrals of 16-D Gaussian 




1e-07 
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Figure 6. The marginalised posterior of the a parameter, sam- 
pled using a Metropolis-Hastings MCMC method. 



Joint r-y posterior distribution 
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Figure 7. The marginalised posterior of the r and 7 parameters, 
sampled using a Metropolis-Hastings MCMC method. 



as in table[T] we take N = 10 5 and c = 0.3, and then we run 
n = 10 4 Metropolis-Hastings chains on this toy-problem. For 
the i chain we then calculate the integral Ii and bootstrap 
error estimate cr BS . We have presented the results of this 
analysis as a histogram of Ii values in Figure[8] Several useful 
quantities that characterise the ensemble are: 



1 " 

/ = - V Ii = 0.980 

r> — ^ 



\ 



1 n 

-E( J '- J T = - 028 



1 n 



(27) 



where I is the integral average, a is the rms of the integral 
values, and <r BS is the rms value of the bootstrap errors. 

Figure[5]shows that the bootstrap error estimate is quite 
correct, since a ~ a BS . However, though smaller than a, 
there is a significant deviation in the value of of I compared 




Calculated value integral 

Figure 8. Histogram of the frequency of calculated integral us- 
ing the method developed in this paper. We have taken the 16- 
dimensional Gaussian of equation (121 ft as integrand. Here we have 
analysed it with N = 10 5 , and c = 0.3. This histogram has mean 
and standard deviation: / = 0.980, a = 0.028. The rms of the 
bootstrap error of the ensemble was cf BS = 0.027 



to the true value I = 1. As we have discussed in section 
the criterion for the algorithm to converge to the correct 
value is that the small subset is sufficiently populated by 
MCMC samples with density proportional to /(0) _1 . 

In order to test whether the deviation of I from the true 
value / = 1 is due to not fulfilling requirement 2 of section 
13.21 we will perform 3 additional tests. In all 3 cases, we will 
construct a new ensemble of MCMC chains identical to the 
ensemble above, except for one parameter. The differences 
with the above mentioned ensemble are: 

1. Instead of a correlated MCMC chain, we use an MCMC 
chain of uncorrelated samples, produced by performing a 
regular MCMC but only storing every lOO*' 1 sample. 

2. N = 10 7 , instead of N = 10 5 . This results in much more 
samples in the subset F^. 

3. c = 0.7, instead of c = 0.3, which also results in more 
samples in the subset F^. 

We present the results of this analysis as the values of 
equation (|27p in table [3] All adjustments seem to improve 
the precision of the algorithm. Several notes are in order: 

1. The only reason we can increase c is because we know ex- 
actly what the integrand looks like. In practical applications 
this is probably not an option. Also note that bootstrap er- 
ror increases, indicating that estimate of the integral is less 
stable. 

2. The fact that the uncorrelated chain performs as well 
as theoretically possible according equation (|15[1 shows that 
the algorithm suffers significantly under the use of correlated 
MCMC chains. 

3. Increasing the amount of samples in a chain makes the 
calculated integral more reliable, as the subset F^ is more 
densely populated. Note that this large chain is build up of 
small chains that would yield a biased value. 
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Ensemble statistics for different parameters 



Cor./Unc. 


N 


c 




17 


a BS 


Cor. 


to 5 


0.3 


0.980 


0.028 


0.027 


Unc. 


10 5 


0.3 


1.000 


0.006 


0.006 


Cor. 


to 7 


0.3 


1.000 


0.003 


0.003 


Cor. 


10 s 


0.7 


0.994 


0.020 


0.034 



Table 3. The statistics of equation (1276 for various ensembles of 
n = 10 4 MCMC runs on a 16-dimensional Gaussian. The first 
chain has the same parameters as used in section 15.11 The other 
chains differ in either number of samples per chain N, the size 
c of the subset F^, or whether or not the samples in a chain are 
correlated. Note that the error in the uncorrelated chain equals 
the theoretical Poissonian limit of a = v/l/ciV = 0.006. 

6 DISCUSSION AND CONCLUSIONS 

We develop and test a new algorithm that uses traditional 
MCMC methods to accurately evaluate numerical integrals 
that typically arise when evaluating the Bayesian evidence. 
The new method can be applied to MCMC chains that have 
already been run in the past so that no new samples have to 
be drawn from the integrand. We test the new algorithm on 
several toy-problems, and we compare the results to other 
algorithms: MultiNest and Parallel tempering. We conclude 
that there is no single algorithm that outperforms all others 
in all cases. High-dimensional, peaked problems are better 
tackled using a traditional MCMC method, whereas very 
complicated multimodal problems are probably best han- 
dled with MultiNest. When applicable, the new algorithm 
significantly outperforms other algorithms, provided that 
the MCMC has been properly executed. This new algorithm 
is therefore expected to be useful in astrophysics, cosmology 
and particle physics. 

We have demonstrated that the new algorithm suffers 
under the use of correlated MCMC chains, produced by 
using the entire chain of a particular MCMC method like 
Metropolis-Hastings. If the new algorithm is used in combi- 
nation with an uncorrelated chain, the accuracy of the nu- 
merical integral can reach the theoretical limit for stochastic 
methods: a — lj\TN , with a the uncertainty, / the value of 
the integral, and N the amount of MCMC samples. Using 
correlated MCMC samples can significantly increase the in- 
tegral uncertainty, and longer MCMC chains are needed for 
the integral to converge. Additional tests to assess conver- 
gence are required. 
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6.1 Comparison to other work 

When this paper was alread y subm itted to Monthly No- 
tices, a preprint by IWeinber g] ( |2009l ) appeared on the arxiv 
which also attempts to construct an algorithm to calculate 
the Bayesian evidence using MCMC samples of the poste- 
rior. Weinberg first discusses the flaws of the harmonic mean 
estimator, and then proposes 2 algorithms to overcome these 
flaws. Weinberg then tests these interesting modifications to 
the harmonic mean estimator on simulated datasets. The al- 
gorithm proposed by Weinberg bears some resemblance to 
the algorithm in this paper in that one should only use a 
well-sampled subset of the parameter space when estimat- 
ing the Bayesian evidence. 



